We present a new perspective on loss minimization and the recent notion of Omniprediction through the lens of Outcome Indistingusihability. For a collection of losses and hypothesis class, omniprediction requires that a predictor provide a loss-minimization guarantee simultaneously for every loss in the collection compared to the best (loss-specific) hypothesis in the class. We present a generic template to learn predictors satisfying a guarantee we call Loss Outcome Indistinguishability. For a set of statistical tests--based on a collection of losses and hypothesis class--a predictor is Loss OI if it is indistinguishable (according to the tests) from Nature's true probabilities over outcomes. By design, Loss OI implies omniprediction in a direct and intuitive manner. We simplify Loss OI further, decomposing it into a calibration condition plus multiaccuracy for a class of functions derived from the loss and hypothesis classes. By careful analysis of this class, we give efficient constructions of omnipredictors for interesting classes of loss functions, including non-convex losses. This decomposition highlights the utility of a new multi-group fairness notion that we call calibrated multiaccuracy, which lies in between multiaccuracy and multicalibration. We show that calibrated multiaccuracy implies Loss OI for the important set of convex losses arising from Generalized Linear Models, without requiring full multicalibration. For such losses, we show an equivalence between our computational notion of Loss OI and a geometric notion of indistinguishability, formulated as Pythagorean theorems in the associated Bregman divergence. We give an efficient algorithm for calibrated multiaccuracy with computational complexity comparable to that of multiaccuracy. In all, calibrated multiaccuracy offers an interesting tradeoff point between efficiency and generality in the omniprediction landscape.
translated by 谷歌翻译
对于数字化或索引物理文档,光学特征识别(OCR)是从扫描文档中提取文本信息的过程,是一项重要技术。当文档在视觉上损坏或包含非文本元素时,现有技术会产生差的结果,因为错误的检测结果可能会极大地影响OCR的质量。在本文中,我们提出了一个针对商务文件的businet的检测网络。业务文件通常包括敏感信息,因此无法将其上传到OCR的云服务。Businet被设计为快速和轻巧,因此可以在本地避免使用隐私问题。此外,Businet旨在使用专门的合成数据集来处理扫描的文档损坏和噪声。通过采用对抗性训练策略,该模型可实现可观的噪音。我们对可公开可用的数据集进行评估,以证明我们的模型的有用性和广泛适用性。
translated by 谷歌翻译